Beyond the profile: adventures in machine learning and automation
Julian Flowers
15/09/2017
A bit about data science…
- include a couple of charts about data maturity cycle
Profiles
- There are currently 67 profiles with 295 domains
- 1662 unique indicators
- The most frequently used indicators are Index of multiple deprivation score (IMD 2015), % of women who smoke at time of delivery, Prevalence of overweight (including obese) among children in Reception, Prevalence of overweight (including obese) among children in Year 6 , Rate of conceptions per 1,000 females aged 15-17 , Rate of chlamydia detection per 100,000 young people aged 15 to 24, 2.02i - % of all mothers who breastfeed their babies in the first 48hrs after delivery, Smoking Prevalence in adults - current smokers (APS), 3.03x - % of eligible children who have received two doses of MMR vaccine on or after their 1st birthday and at any time up to their 5th birthday, 1.02i - School Readiness: all children achieving a good level of development at the end of reception as a percentage of all eligible children., 3.03xii - Population vaccination coverage for one dose (females 12-13 years old) - HPV , Hospital admissions for alcohol-specific conditions, under 18s, crude rate per 100,000 population.
- That is ~ 56600 spine charts
- (Not including the ~ 120000 spine charts in practice profiles)
What if I asked you…?
- Within in any one profile or domain
- How similar are the spine charts?
- Can we group local authorities on the basis of their profile?
- What stories do the data tell?
- Even the “skinniest” domain has at least 5 indicators - some have 50+
- This is complex multivariate data
Automation
- Shiny document
- For loop
- Parameterisation
- R Markdown
Fingertips API and fingertipsR
- APIs = Application Programming Interface
- Websites for computers
- Fingertips API

fingertipsR

Machine learning
- 2 types
- Train, test, validate
- 5 questions
- How much?
- Is it A or B (classification)
- Is it weird? (outliers and anomalies)
- Are there patterns in the data (clustering = unsupervised ML)
- So what - what next?
- Hundreds of algorithms
Diabetes data

Correlations

Correlation network

Clustering

What (if anything) is different about North Essex
kmeans

Data dimensions - dimensionality reduction
